Multiple Linear Regression - Principles

Notes and in-class exercises

You can download the .qmd file for this activity here and open in R-studio. The rendered version is posted in the course website (Activities tab). I often experiment with the class activities (and see it in live!) and make updates, but I always post the final version before class starts. To be sure you have the most up-to-date copy, please download it once you’ve settled in before class begins.

Notes

Learning goals

Working with multiple predictors in our plots and models can get complicated!

There are no recipes for this process.

BUT there are some guiding principles that assist in long-term retention, deeper understanding, and the ability to generalize our tools in new settings.

By the end of this lesson, you should be familiar with some general principles for…

  • incorporating additional quantitative or categorical predictors in a visualization
  • how additional quantitative or categorical predictors impact the physical representation of a model
  • interpreting quantitative or categorical coefficients in a multiple regression model

Readings and videos

Please watch the following video before class.

Please watch the following video that include extra examples after class.

Exercises

Let’s revisit the bikeshare data:

# Load packages & import data
library(tidyverse)
bikes <- read_csv("https://mac-stat.github.io/data/bikeshare.csv") %>% 
  rename(rides = riders_registered)

Our goal is to explain / predict how registered ridership varies from day to day.

To this end, we’ll build various multiple linear regression models of rides by different combinations of the possible predictors.

# Check out the data
head(bikes)
## # A tibble: 6 × 15
##   date       season  year month day_of_week weekend holiday temp_actual
##   <date>     <chr>  <dbl> <chr> <chr>       <lgl>   <chr>         <dbl>
## 1 2011-01-01 winter  2011 Jan   Sat         TRUE    no             57.4
## 2 2011-01-02 winter  2011 Jan   Sun         TRUE    no             58.8
## 3 2011-01-03 winter  2011 Jan   Mon         FALSE   no             46.5
## 4 2011-01-04 winter  2011 Jan   Tue         FALSE   no             46.8
## 5 2011-01-05 winter  2011 Jan   Wed         FALSE   no             48.7
## 6 2011-01-06 winter  2011 Jan   Thu         FALSE   no             47.1
## # ℹ 7 more variables: temp_feel <dbl>, humidity <dbl>, windspeed <dbl>,
## #   weather_cat <chr>, riders_casual <dbl>, rides <dbl>, riders_total <dbl>

Exercise 1: Review visualization

Let’s build a model of rides by windspeed (quantitative) and weekend status (categorical).

Plot & describe, in words, the relationship between these 3 variables.

bikes %>% 
  ggplot(aes(y = rides, x = windspeed, color = weekend)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

Live-Note Taking!

Exercise 2: Review model

Let’s build the model. Run the following code:

bike_model_1 <- lm(rides ~ windspeed + weekend, data = bikes)
coef(summary(bike_model_1))
##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 4738.38053  147.53653 32.116659 1.208405e-141
## windspeed    -63.97072   10.45274 -6.119997  1.528443e-09
## weekendTRUE -925.15701  119.86330 -7.718434  3.891082e-14

The estimated model formula is therefore:

E[rides | windspeed, weekendTRUE] = 4738.38 - 63.97 * windspeed - 925.16 * weekendTRUE

This model formula is represented by 2 lines, one corresponding to weekends and the other to weekdays. Simplify the model formula above for weekdays and weekends:

weekdays: rides = 4738.38 - 63.97 *windspeed

weekends: rides = (4738.38- 925.16) - 63.97 *windspeed

Exercise 3: Review coefficient interpretation

  1. The intercept coefficient, 4738.38, represents the intercept of the sub-model for weekdays, the reference category. What’s its contextual interpretation?

  2. The windspeed coefficient, -63.97, represents the shared slope of the weekend and weekday sub-models. What’s its contextual interpretation?

  3. The weekendTRUE coefficient, -925.16, represents the change in intercept for the weekend vs weekday sub-model. What’s its contextual interpretation?

Exercise 4: 2 categorical predictors – visualization

Thus far, we’ve explored a couple examples of multiple regression models that have 2 predictors, 1 quantitative and 1 categorical.

So what happens when both predictors are categorical?!

To this end, let’s model rides by weekend status and season.

The below code plots rides vs season.

Modify this code to also include information about weekend.

HINT: Remember the visualization principle that additional categorical predictors require some sort of grouping mechanism / mechanism that distinguishes between the 2 groups.

# rides vs season
bikes %>% 
  ggplot(aes(y = rides, x = season)) + 
  geom_boxplot()


# rides vs season AND weekend
bikes %>%
  ggplot(aes(y = rides, x = season, fill = weekend)) +
  geom_boxplot()

Exercise 5: follow-up

  1. Describe (in words) the relationship of ridership with season & weekend status.

  2. A model of rides by season alone would be represented by only 4 expected outcomes, 1 for each season. Considering this and the plot above, how do you anticipate a model of rides by season and weekend status will be represented?

    • 2 lines, 1 for each weekend status
    • 8 lines, 1 for each possible combination of season & weekend
    • 2 expected outcomes, 1 for each weekend status
    • 8 expected outcomes, 1 for each possible combination of season & weekend

Exercise 6: 2 categorical predictors – build the model

Let’s build the multiple regression model of rides vs season and weekend:

bike_model_2 <- lm(rides ~ weekend + season, bikes)
coef(summary(bike_model_2))
##                Estimate Std. Error     t value      Pr(>|t|)
## (Intercept)   4260.4492   99.16363  42.9638294 1.384994e-201
## weekendTRUE   -912.3324  103.23016  -8.8378473  7.298199e-18
## seasonspring  -116.3824  132.76018  -0.8766364  3.809741e-01
## seasonsummer   438.4424  132.06413   3.3199205  9.454177e-04
## seasonwinter -1719.0572  133.30505 -12.8956646  2.081758e-34

Thus estimated model formula is given by:

E[rides | weekend, season] = 4260.45 - 912.33 weekendTRUE - 116.38 seasonspring + 438.44 seasonsummer - 1719.06 seasonwinter

  1. Use this model to predict the ridership on the following days:
# a fall weekday
4260.45 - 912.33*0 - 116.38*0  + 438.44*0 - 1719.06*0
## [1] 4260.45

# a winter weekday    
4260.45 - 912.33*0 - 116.38*0  + 438.44*0 - 1719.06*1
## [1] 2541.39

# a fall weekend day        
4260.45 - 912.33*1 - 116.38*0  + 438.44*0 - 1719.06*0
## [1] 3348.12

# a winter weekend day
4260.45 - 912.33*1 - 116.38*0  + 438.44*0 - 1719.06*1
## [1] 1629.06
  1. We only made 4 predictions here. How many possible predictions does this model produce? Is this consistent with your intuition in the previous exercise?

Exercise 7: 2 categorical predictors – interpret the model

Use your above predictions and visualization to fill in the below interpretations of the model coefficients.

Hint: What is the consequence of plugging in 0 or 1 for the different weekend and season categories?

  1. Interpreting 4260: On average, we expect there to be 4260 riders on (weekdays/weekends) during the (fall/spring/summer/winter).

  2. Interpreting -912: On average, in any season, we expect there to be 912 (more/fewer) riders on weekends than on ___.

  3. Interpreting -1719: On average, on both weekdays and weekends, we expect there to be 1719 (more/fewer) riders in winter than in ___.

Exercise 8: 2 quantitative predictors – visualization

Next, consider the relationship between rides and 2 quantitative predictors: windspeed and temp_feel. Check out the plot of this relationship below which reflects the visualization principle that quantitative variables require some sort of numerical scaling mechanism – rides and windspeed get numerical axes, and temp_feel gets a color scale:

Modify the code below to recreate this plot.

bikes %>%
  ggplot(aes(y = rides, x = windspeed, color = temp_feel)) +
  geom_point() #color = temp_feel

OPTIONAL: Check out this interactive plot which allows us to explore this point cloud in 3D.

# Install the "plotly" package in R Console First!
library(plotly)
bikes %>% 
  plot_ly(x = ~windspeed, y = ~temp_feel, z = ~rides, 
          type = "scatter3d", 
          mode = "markers",
          marker = list(size = 5, color = ~rides, colorscale = "Viridis"))

Exercise 9: follow-up

Describe (in words) the relationship of ridership with windspeed & temperature.

Exercise 10: 2 quantitative predictors – modeling

Let’s build the multiple regression model of rides vs windspeed and temp_feel:

bike_model_3 <- lm(rides ~ windspeed + temp_feel, data = bikes)
coef(summary(bike_model_3))
##              Estimate Std. Error     t value     Pr(>|t|)
## (Intercept) -24.06464 299.303032 -0.08040225 9.359394e-01
## windspeed   -36.54372   9.408116 -3.88427585 1.119805e-04
## temp_feel    55.51648   3.330739 16.66791759 4.436963e-53

Thus estimated model formula is

E[rides | windspeed, temp_feel] = -24.06 - 36.54 windspeed + 55.52 temp_feel

  1. Interpret the intercept coefficient, -24.06, in context. Is this a correct interpretation?

  2. Interpret the windspeed coefficient, -36.54, in context.

  3. Interpret the temp_feel coefficient, 55.52, in context.

Exercise 11: Which is “best”?

We’ve now observed 3 different models of ridership, each having 2 predictors. The R-squared values of these models, along with those of the simple linear regression models with each predictor alone, are summarized below.

model predictors R-squared
bike_model_1 windspeed & weekend 0.119
bike_model_2 weekend & season 0.349
bike_model_3 windspeed & temp_feel 0.310
bike_model_4 windspeed 0.047
bike_model_5 temp_feel 0.296
bike_model_6 weekend 0.074
bike_model_7 season 0.279
  1. Which model does the best job of explaining the variability in ridership from day to day?

  2. If you could only pick one predictor, which would it be?

  3. What happens to R-squared when we add a second predictor to our model, and why does this make sense? For example, how does the R-squared for model 1 (with both windspeed and weekend) compare to those of model 4 (only windspeed) and model 6 (only weekend)?

  4. Are 2 predictors always better than 1? Provide evidence and explain why this makes sense.





Extra practice

The following exercises provide extra practice. If you don’t get to these during class, you’re encouraged to try them outside class.

Exercise 13: Practice 1

Consider the relationship of rides vs weekend and weather_cat.

  1. Construct a visualization of this relationship.
  2. Construct a model of this relationship.
  3. Interpret the first 3 model coefficients.

Exercise 14: Practice 2

Consider the relationship of rides vs temp_feel and humidity.

  1. Construct a visualization of this relationship.
  2. Construct a model of this relationship.
  3. Interpret the first 3 model coefficients.

Exercise 15: Practice 3

Consider the relationship of rides vs temp_feel and weather_cat.

  1. Construct a visualization of this relationship.
  2. Construct a model of this relationship.
  3. Interpret the first 3 model coefficients.

Exercise 16: CHALLENGE

We’ve explored models with 2 predictors. What about 3 predictors?! Consider the relationship of rides vs temp_feel, humidity, AND weekend.

  1. Construct a visualization of this relationship.
  2. Construct a model of this relationship.
  3. Interpret each model coefficient.





Solutions

Exercise 1: Review visualization

bikes %>% 
  ggplot(aes(y = rides, x = windspeed, color = weekend)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

Exercise 2: Review model

weekdays: rides = 4738.38 - 63.97 windspeed

weekends: rides = 4738.38 - 63.97 windspeed - 925.16 = 3813.22 - 63.97 windspeed

Exercise 3: Review coefficient interpretation

  1. On average, there are 4738 riders on weekdays with 0 windspeed.

  2. On both weekends and weekdays, a 1mph increase in windspeed is associated with 64 fewer riders on average.

  3. At any windspeed, the average ridership is 925 lower on weekends than week days.

Exercise 4: 2 categorical predictors – visualization

bikes %>% 
  ggplot(aes(y = rides, x = season, fill = weekend)) + 
  geom_boxplot()

Exercise 5: follow-up

  1. In every season, ridership tends to be lower on weekends. Across weekend status, ridership tends to be highest in summer and lowest in winter.

  2. 8 expected outcomes

Exercise 6: 2 categorical predictors – build the model

#fall weekday:    
4260.45 - 912.33*0 - 116.38*0 + 438.44*0 - 1719.06*0
## [1] 4260.45

#winter weekday:
4260.45 - 912.33*0 - 116.38*0 + 438.44*0 - 1719.06*1
## [1] 2541.39

#fall weekend:    
4260.45 - 912.33*1 - 116.38*0 + 438.44*0 - 1719.06*0
## [1] 3348.12

#winter weekend:
4260.45 - 912.33*1 - 116.38*0 + 438.44*0 - 1719.06*1
## [1] 1629.06
  1. 8: 2 weekend categories * 4 season categories

Exercise 7: 2 categorical predictors – interpret the model

  1. We expect there to be, on average, 4260 riders on weekdays during the fall.
  2. On average, in any season, there are 912 fewer riders on weekends than on weekdays.
  3. On average, on both weekdays and weekends, we expect there to be 1719 fewer riders in winter than in fall.

Exercise 8: 2 quantitative predictors – visualization

bikes %>% 
  ggplot(aes(y = rides, x = windspeed, color = temp_feel)) + 
  geom_point() 

Exercise 9: follow-up

Ridership tends to increase with temperature (no matter the windspeed) and decrease with windspeed (no matter the temperature).

Exercise 10: 2 quantitative predictors – modeling

  1. -24.06 = average ridership on days with 0 windspeed and a 0 degree temperature. (Note: this is a correct interpretation, even though it doesn’t make conceptual sense! The model doesn’t know that ridership can’t be negative!)

  2. At any temperature, a 1mph increase in windspeed is associated with a 37 rider decrease in average ridership.

  3. At any windspeed, a 1 degree increase in temperature is associated with a 56 rider increase in average ridership.

Exercise 11: Which is best?

  1. model 2
  2. temperature
  3. R-squared increases (our model is stronger when we include another predictor)
  4. nope. model 1 (with windspeed and weekend) has a lower R-squared than model 5 (with only temperature)



Exercise 13: Practice 1

bikes %>% 
  ggplot(aes(y = rides, x = weekend, fill = weather_cat)) + 
  geom_boxplot()


new_model_1 <- lm(rides ~ weekend + weather_cat, bikes)
coef(summary(new_model_1))
##                     Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)        4211.8741   75.54724 55.751529 9.461947e-265
## weekendTRUE        -982.2106  117.24719 -8.377264  2.786301e-16
## weather_catcateg2  -608.8640  113.00211 -5.388077  9.628947e-08
## weather_catcateg3 -2360.2049  319.71640 -7.382183  4.270163e-13
  • The average ridership on a weekday with nice weather (categ1) is 4212 rides.

  • On days with the same weather, the average ridership is 982 less on a weekend than on a weekday.

  • On any day of the week, the average ridership is 609 less on dreary days than on nice days.



Exercise 14: Practice 2

bikes %>% 
  ggplot(aes(y = rides, x = temp_feel, color = humidity)) + 
    geom_point()


new_model_2 <- lm(rides ~ temp_feel + humidity, bikes)
coef(summary(new_model_2))
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)   315.83704 303.777334  1.039699 2.988249e-01
## temp_feel      60.43316   3.272315 18.468015 9.451345e-63
## humidity    -1868.99356 336.963661 -5.546573 4.078901e-08
  • On average, there are 316 riders on days with 0 humidity that feel like 0 degrees.

  • At any humidity, a 1 degree increase in ridership is associated with a 60 ride increase in average ridership.

  • At any temperature, a 1 percentage point increase in humidity (i.e. a 0.1 increase in humidity) is associated with a 187 decrease in average ridership.



Exercise 15: Practice 3

new_model_3 <- lm(rides ~ temp_feel + weather_cat, bikes)
coef(summary(new_model_3))
##                      Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)        -288.68840 251.264383 -1.148943 2.509574e-01
## temp_feel            55.30133   3.215495 17.198387 7.082670e-56
## weather_catcateg2  -386.42241 100.187725 -3.856984 1.249775e-04
## weather_catcateg3 -1919.01375 283.022420 -6.780430 2.481218e-11

bikes %>% 
  ggplot(aes(y = rides, x = temp_feel, color = weather_cat)) + 
  geom_point() + 
  geom_line(aes(y = new_model_3$fitted.values), size = 1.5)

  • On average, there are -289 riders on nice weather days that feel like 0 degrees. (Note: this is a correct interpretation, even though it doesn’t make conceptual sense!)

  • On days with the same weather, a 1 degree increase in temperature is associated with a 55 ride increase in average ridership.

  • On days with the same temperature, average ridership is 386 rides lower on a dreary weather day than on a nice weather day.

Exercise 16: CHALLENGE

bikes %>% 
  ggplot(aes(y = rides, x = temp_feel, color = humidity, size = weekend)) + 
  geom_point()


new_model_4 <- lm(rides ~ temp_feel + humidity + weekend, bikes)
coef(summary(new_model_4))
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)   668.60236 292.181063  2.288315 2.240530e-02
## temp_feel      59.36751   3.119256 19.032585 7.626695e-66
## humidity    -1906.43437 320.982938 -5.939364 4.433789e-09
## weekendTRUE  -869.05771 100.057822 -8.685555 2.471050e-17
  • On average, there are 669 riders on weekdays that feel like 0 degrees and have no humidity.

  • On days with the same humidity levels and time of the week, a 1 degree increase in temperature is associated with a 59 ride increase in average ridership.

  • On days with the same temperature and time of the week, a 0.1 point increase in humidity levels (i.e. a 1 percentage point increase in humidity) is associated with a 190.6 ride decrease in average ridership.

  • On days with the same temperature and humidity, the average ridership is 869 rides lower on weekends compared to weekdays.